Objective: create a bar chart that shows the age distribution of respondents.
While this does not require complex transformations, the intent of this notebook is to showcase some exploratory data analysis and data cleaning to solve some common issues with numeric data, with a nice visualization at the end.
Here we import the necessary modules
import pandas as pd
from os import getcwd, path
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
And load the CSV
path_to_data = path.join(getcwd(), "data", "survey_results_public.csv")
data = pd.read_csv(path_to_data)
data
| Respondent | MainBranch | Hobbyist | Age | Age1stCode | CompFreq | CompTotal | ConvertedComp | Country | CurrencyDesc | ... | SurveyEase | SurveyLength | Trans | UndergradMajor | WebframeDesireNextYear | WebframeWorkedWith | WelcomeChange | WorkWeekHrs | YearsCode | YearsCodePro | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | I am a developer by profession | Yes | NaN | 13 | Monthly | NaN | NaN | Germany | European Euro | ... | Neither easy nor difficult | Appropriate in length | No | Computer science, computer engineering, or sof... | ASP.NET Core | ASP.NET;ASP.NET Core | Just as welcome now as I felt last year | 50.0 | 36 | 27 |
| 1 | 2 | I am a developer by profession | No | NaN | 19 | NaN | NaN | NaN | United Kingdom | Pound sterling | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | NaN | NaN | Somewhat more welcome now than last year | NaN | 7 | 4 |
| 2 | 3 | I code primarily as a hobby | Yes | NaN | 15 | NaN | NaN | NaN | Russian Federation | NaN | ... | Neither easy nor difficult | Appropriate in length | NaN | NaN | NaN | NaN | Somewhat more welcome now than last year | NaN | 4 | NaN |
| 3 | 4 | I am a developer by profession | Yes | 25.0 | 18 | NaN | NaN | NaN | Albania | Albanian lek | ... | NaN | NaN | No | Computer science, computer engineering, or sof... | NaN | NaN | Somewhat less welcome now than last year | 40.0 | 7 | 4 |
| 4 | 5 | I used to be a developer by profession, but no... | Yes | 31.0 | 16 | NaN | NaN | NaN | United States | NaN | ... | Easy | Too short | No | Computer science, computer engineering, or sof... | Django;Ruby on Rails | Ruby on Rails | Just as welcome now as I felt last year | NaN | 15 | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 64456 | 64858 | NaN | Yes | NaN | 16 | NaN | NaN | NaN | United States | NaN | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | NaN | NaN | NaN | NaN | 10 | Less than 1 year |
| 64457 | 64867 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Morocco | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 64458 | 64898 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Viet Nam | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 64459 | 64925 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Poland | NaN | ... | NaN | NaN | NaN | NaN | Angular;Angular.js;React.js | NaN | NaN | NaN | NaN | NaN |
| 64460 | 65112 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Spain | NaN | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | ASP.NET Core;jQuery | Angular;Angular.js;ASP.NET Core;jQuery | NaN | NaN | NaN | NaN |
64461 rows × 61 columns
We can keep only the columns needed for this notebook
data = data[["Age"]]
Note: we'd need a second column for counting if we were going to use aggregations to calculate the age frequencies.
Now that we have the data loaded, the next step is to process it.
To get Age fully ready for visualization, we need to complete four operations:
Let's explore the data with a scatter plot to see the ages reported.
# Plotly automatically assigns a X-axis coordinate to each age response
fig = px.scatter(data, y="Age")
fig.show()
The scatter plot shows that respondents close to 100 years or older may not exactly represent accurate data points. Furthermore, there is threshold at which children will be too young to be replying to the survey.
Additionally, Plotly visualizations are interactive so we can freely zoom in and out of the visual to analyse parts in greater detail.
With everything considered, we can filter the data keeping only respondents that are at least 10 years-old or younger than 76. Of course, there may have been respondents that were truly outside this range, but this seems a safe range considering the context.
data = data.query("(Age >= 10) and (Age <= 75)")
print(f"Rows left: {data.shape[0]:,}")
# Equivalent to
# data = data[(data["Age"] >= 10) & (data["Age"] <= 75)]
Rows left: 45,385
# Plotly automatically assigns a X-axis coordinate to each age response
fig = px.scatter(data, y="Age")
fig.show()
The age distribution seems cleaner.
There are two areas with slightly different distributions from the rest (before the 40k respondents and after the 60k), which may indicate an interesting pattern to delve deeper into or maybe just a coincidence from the way data is sorted. However we won't explore that here.
In the context of the complete dataset maybe you'd find an interesting insight, like only a certain age group was filling the survey in a short timeframe, but in this example our only objective is to arrive at that bar chart of ages.
Next we need to remove decimal ages. There are about twenty respondents with decimal ages, e.g. 15.5 years. This data issue can be dealt with using different approaches, like rounding up or down, replacing by the average age, ..., but we'll take the easy way out and simply remove them.
# Correct ages should be written as 15.0 or 37.0, so the integer representations are equivalent
is_integer = lambda row: int(row["Age"]) == row["Age"]
data = data[data.apply(is_integer, axis="columns")]
print(f"Rows left: {data.shape[0]:,}")
Rows left: 45,362
As with the decimal ages, blank responses can be resolved in multiple ways. Again, we'll simply remove them in this notebook.
data = data.dropna(axis="rows", how="any", subset=["Age"])
print(f"Rows left: {data.shape[0]:,}")
Rows left: 45,362
There actually weren't any blank ages, but I wanted to show anyway how to drop rows with blanks through different conditions. The way the function was called, it deletes rows that have any blank Age only, it ignores blanks in the other columns.
The call to dropna has more arguments than needed, as "rows" is the default axis value, and Respondent is always filled in because it is the id of the respondent in the original dataset, so you wouldn't need the subset argument. Plus, "any" is also the default how value.
In other words, the call for this example in specific could be just data.dropna(), but this way I was able to show alternatives that make this function more versatile.
The last step we need before creating the bar chart is to count the frequency of each age, i.e., the values for the Y axis of the chart.
value_counts returns a Series with the ages as indices and their frequencies as values, sorted from most to least frequent.
age_counts = data["Age"].value_counts()
If we pass the Series of frequencies to the bar function, it is enough to create the bar chart.
Plotly sorts the ages (X axis) automatically.
fig = px.bar(age_counts, title="Age of respondents")
fig.update_layout(
xaxis_title = "Age",
yaxis_title = "Frequency",
title_x = 0.5, # Center the title horizontally
showlegend = False
)
fig.show()